knitr::opts_chunk$set(echo = TRUE)

Task 1

Get working directory

getwd()

Task 2

spruce.df = read.csv("SPRUCE.csv")
tail(spruce.df)

Task 3

Lowess smoother (Height vs BHDiameter)

library(s20x)
trendscatter(Height~BHDiameter,f=0.5,data=spruce.df)

Create spruce.lm, height.res, and height.fit

spruce.lm <- with(spruce.df, lm(Height~BHDiameter))
height.res <- resid(spruce.lm)
height.fit <- fitted(spruce.lm)

Plot residuals vs fitted

plot(height.res, height.fit)

Plot residuals vs fitted using trendscatter()

trendscatter( height.fit,height.res)

The first trendscatter is more of a line while the second appears to be quadratic.

Make the residual plot

plot(spruce.lm, which =1)
normcheck(spruce.lm,shapiro.wilk = TRUE)

The p-value is 0.29

Since the p value > .05, we cannot cannot reject the null hypothesis that the data is normally distributed.

I think when we apply a straight line to this data set, we are making the assumption that the data is normally distributed. In this case, we have shown that there is an acceptable probability that the data set is indeed normally distributed by showing that the p-value is greater than .05. Therefore, I will conclude that we can apply a straight line to the data set with an acceptable certainty that it will be valid.

Task 4

#### Make a quadratic model
quad.lm=lm(Height~BHDiameter + I(BHDiameter^2),data=spruce.df)

# Find the coefficients
coef(quad.lm)

#Make a function that produces heights for inputs "x"
myplot=function(x){
 0.86089580 +1.46959217*x  -0.02745726*x^2
}
plot(Height~BHDiameter, data= spruce.df)
# add the quadratic to the points
curve(myplot, lwd=2, col="steelblue",add=TRUE)

quad.fit = fitted(quad.lm)

plot(quad.lm, which =1)

normcheck(quad.lm, shapiro.wilk = T)

The p-value is .684

The p-value is > .05 so we can apply the curve to it with an acceptable certainty that the data is normally distributed.

Task 5

Summarize quad.lm

summary(quad.lm)

b0, b1, b2

b0 = .8609

b1 = 1.4696

b2 = -0.0275

Make interval estimates

ciReg(quad.lm)

Write down the equation of the fitted line

y = 0.8609+1.4696x-.0275x^2

Predict Height when BHDiameter is 15,18,20.

predict(quad.lm, data.frame(BHDiameter = c(15,18,20)))

Values in quad.lm are higher than they are in spruce.lm

Multiple R squared

Value is 0.7741 which is higher than the previous model (0.6569).

Which is better?

quad.lm

What does multiple R^2 mean

Percentage of variation caused by independent variable

Which model explains the most variability in Height?

quad.lm

Use anova()

anova(spruce.lm)
anova(quad.lm)
summary(quad.lm)

quad.lm has a lower RSS so it is the more accurate model

TSS, MSS, RSS, MSS/TSS

yhat = quad.fit
RSS=with(spruce.df,sum((Height-yhat)^2))
MSS=with(spruce.df,sum((yhat-mean(Height))^2))
TSS=with(spruce.df,sum((Height-mean(Height))^2))
print(paste0("TSS: ", TSS))
print(paste0("MSS: ", MSS))
print(paste0("RSS: ", RSS))
print(paste0("MSS/TSS: ", MSS/TSS))

Task 6

cooks20x(quad.lm)

What is Cook's distance?

According to "datasciencecentral.com", Cook's Distance is used in regression analysis to find outliers that may be negatively affecting our regression model. The measurement is a combination of each observation's leverage and residual value. So, the higher the leverage and residuals, the higher the Cook's distance.

What does the cooks distance for quad.lm tell me?

My research has shown that there are several methods of determining when we should be interested in a possible influential Cook's distance. The first and second ones say that if the distance is greater than .5 and 1, respectively. Another method says it could be helpful to just use visual methods. If we use 1 and 2, there would seemingly be no influential distance. But looking at the graph there is clearly what looks to be an influential observation (24) and a couple of possible observations in 18, 21.

quad2.lm and reanalyze

quad2.lm=lm(Height~BHDiameter + I(BHDiameter^2) , data=spruce.df[-24,])
summary(quad2.lm)
summary(quad.lm)

Nearly all of the metrics are larger in the quad2.lm model, however it doesn't seem like they are much larger. So it is obvious that there was some influence by the observation we removed, but I don't think it was having a negative impact on the original model.

Task 7

Proof

Start with two line equations and a common $x_k$ knot point:

$L_1: y = \beta_0 + \beta_1x$

$L_2: y = \beta_0 + \delta + (\beta_1 + \beta_2)x$

Since $x_k$ is common to both lines, we will plug it in:

$y_k = \beta_0 + \beta_1x_k = \beta_0 + \delta + (\beta_1 + \beta_2)x_k$

$\quad \quad \beta_0 + \beta_1x_k + \delta + \beta_1x_k + \beta_2x_k$

$\quad \quad 0 = \delta + \beta_2 x_k$

$\quad \quad \delta = -\beta_2 x_k$

Now we look at $L_2$:

$L_2 : y = \beta_0 + \delta + (\beta_1 + \beta_2)x$

Replace $\delta$ with what we found earlier:

$L_2 : y = \beta_0 -\beta_2 x_k + (\beta_1 + \beta_2)x$

$\quad\quad y = \beta_0 - \beta_2x_k + \beta_1x + \beta_2x$

$\quad\quad y = \beta_0 + \beta_1x + \beta_2x - \beta_2x_k$

$\quad\quad y = \beta_0 + \beta_1x + \beta_2(x-x_k)$

We can easily notice that this is the equation $L_1$ with an added part $\beta_2(x-x_k)$

Now we define the indicator function as $I(x>x_k)$ and it will return 1 if $x > x_k$ and 0 if else.

Thus, we have $y = \beta_0 + \beta_1x + \beta_2(x-x_k)I(x>x_k) \square$

Reproduce the graph

coef(spruce.lm)
## piecewise linear model in R
## Model y = b0 + b1x + b2(x-xk)*(x>xk)
## You will need to change the code appropriately
sp2.df=within(spruce.df, X<-(BHDiameter-18)*(BHDiameter>18)) # this makes a new variable and places it within the same df


lmp=lm(Height~BHDiameter + X,data=sp2.df)
tmp=summary(lmp)
myf = function(x,coef){
  coef[1]+coef[2]*(x) + coef[3]*(x-18)*(x-18>0)
}
plot(spruce.df,main="Piecewise regression")
myf(0, coef=tmp$coefficients[,"Estimate"])
curve(myf(x,coef=tmp$coefficients[,"Estimate"] ),add=TRUE, lwd=2,col="Blue")
abline(v=18)
text(18,16,paste("R sq.=",round(tmp$r.squared,4) ))

Task 8

myread function

library(grac0009MATH4753)
mydata <- grac0009MATH4753::myread("SPRUCE.csv")
head(mydata, 4)

The code simply takes in a .CSV file that is in the current working directory and returns the data as a table.

Extra

library(grac0009MATH4753)
spruce.df = grac0009MATH4753::myread("SPRUCE.csv")
quad.lm=lm(Height~BHDiameter + I(BHDiameter^2),data=spruce.df)
quad.fit = fitted(quad.lm)
library(base)
myplot=function(x){
 0.86089580 +1.46959217*x  -0.02745726*x^2
}
plot(Height~BHDiameter, data= spruce.df, main="Spruce height prediction", bg="blue", pch=21,cex=1.1, xlim = c(0, max(BHDiameter)), ylim = c(0, max(Height)))
# add the quadratic to the points
curve(myplot, lwd=2, col="steelblue",add=TRUE)
with(spruce.df, segments(BHDiameter,Height,BHDiameter,myplot(BHDiameter),col="Black"))
lgcooks = spruce.df[c(18,21,24),]
with(lgcooks, segments(BHDiameter,Height,BHDiameter,myplot(BHDiameter),col="red",lwd = 3))
with(spruce.df, text(Height~BHDiameter, labels=row.names(spruce.df), pos = 4, cex = 0.5))
with(spruce.df, arrows(BHDiameter[18], Height[24], BHDiameter[24], Height[24], col="Blue", lwd = 2))
with(spruce.df, text(BHDiameter[18], Height[24], labels = c("Highest Cook's\ndistance"),pos=2,cex=1))


agracy2246/MATH4753grac0009 documentation built on April 26, 2020, 9:39 a.m.